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De novo protein sequencing is essential for understanding cellular processes that govern the fnnc- 
tion of living organisms and all post-translational events and other sequence modifications that 
occur after a protein has been constructed from its corresponding DNA code. By obtaining the or¬ 
der of the amino acids that composes a given protein one can then determine both its secondary and 
tertiary structures through structure prediction, which is used to create models for protein aggre¬ 
gation diseases such as Alzheimer’s Disease. Mass spectrometry is the current technique of choice 
for de novo sequencing. However, because some amino acids have the same mass the sequence 
cannot be completely determined in many cases. Here, we propose a new technique for de novo 
protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and 
measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. 
To calculate the transverse ionic current blockaded by a given amino acid we use a Monte Carlo 
method along with Ramachandran plots to determine the available flow area, modified by the local 
density of ions obtained from molecular dynamics and the local flow velocity ratio derived from the 
Stokes equation. We find that the distribution of ionic currents for each of the 20 proteinogenic 
amino acids encoded by eukaryotic genes is statistically distinct, showing this technique’s potential 
for de novo protein sequencing. 
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Living organisms depend on proteins to carry out the 
genetic code and perform many vital cellular tasks like 
metabolism [1]. To understand how a protein works 
one must understand its structure. Proteins are spe¬ 
cial because of how versatile they are in binding to other 
molecules, and the structure of these binding sites often 
indicate the precise use of a protein. 

The first step in understanding protein structure is 
knowing the sequence of a protein, meaning the order 
of the amino acids that compose it. There are 20 amino 
acids that are used as building blocks by eukaryotic genes 
to make proteins, all of which have the same chain of 
atoms as a backbone. What distinguishes each amino 
acid is its side chain, which can span from a single hydro¬ 
gen in the case of glycine (GLY) to containing an indole 
functional group in the case of tryptophan [T] . For a pro¬ 
tein to function these amino acids fold up into secondary 
and tertiary structures that expose features like bind¬ 
ing sites, which can be predicted based on the protein 
sequence. Ongoing research attempts to understand pro¬ 
tein aggregation diseases such as Alzheimer’s Disease [5] 
by performing simulations of structure formation, which 
would not be possible without the knowledge of the com¬ 
ponents of the peptides and proteins involved. In addi¬ 
tion, protein sequences allow the synthesization of other 
proteins, which is necessary to compensate for diseases 
like Diabetes Type I in which the body does not produce 
the necessary peptide hormone insulin [am!. 

The most common method for de novo protein or pep¬ 
tide sequencing (namely sequencing a protein for the first 
time) is mass spectrometry, a technique that involves 
fractionating the peptide into many smaller peptides and 
then obtaining the mass-to-charge ratio of each new pep¬ 


tide from the mass spectrometer. The problem with this 
technique is that fractionation is often carried out with 
gel electrophoresis, which is inherently slow [^. In addi¬ 
tion, fractionation must be repeated many times to ob¬ 
tain small enough peptides so that one can discern the 
composite amino acids from just the total mass-to-charge 
ratio [5]. Also, de novo sequencing is sometimes impossi¬ 
ble with this technique since some amino acids have the 
same mass and charge {e.g., leucine and isoleucine). 

Edman degradation is another common method for de 
novo protein or peptide sequencing that utilizes repeated 
chemical washing and N-terminal cleaving to identify 
the sequence of amino acids one at a time [Tj. How¬ 
ever, Edman degradation suffers from the same issue 
of fractionation as mass spectrometry since devices can 
only reliably sequence peptides up to about 30 amino 
acids |8|. Nonetheless, the end result of identification via 
chromatography of each singled out chemically modified 
amino acid is reliable, albeit slow, but does require the 
use of many reagents. 

The advent of nanopore DNA sequencing PHD] has 
brought several modern techniques to protein detection: 
longitudinal ionic transport [111 112) and transverse elec¬ 
tronic transport [13) . In the case of ionic transport 
through a single nanopore, detection of the protein fold¬ 
ing state is achieved experimentally and modeled with 
exclusion volumes by HU. Of course, protein sequencing 
with such a technique is a more difficult task and has not 
been achieved as of yet [T2|. In fact, longitudinal ionic 
transport detects a current blockade which is the con¬ 
volution of several blockade events from different amino 

acids m- 

Transverse electronic transport, a technique in which 
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FIG. 1. (Color online) A schematic of the transverse ionic transport sequencing method. Two nanochannels intersect: the 
vertical or longitudinal channel along 2 with radius R and the horizontal or transverse channel along y with radius r. The 
polypeptide translocates along the longitudinal channel crossing the transverse channel that contains ions, purple K"*" and 
green Cl“, that flow along the transverse channel due to an electric field, E± , in the +y direction. In this case the polypeptide 
consists of neurokinin A starting at the C-terminus at the top of the figure attached to one cysteine (CYS) followed by 10 
glutamic acids (GLUs), a negatively charged amino acid, where the last GLU makes up the N-terminus (see section Sequencing 
Protocol for more on this structure). This negatively charged polypeptide is driven towards —2 by an electric field, i5||, in the 
+2 direction. The dotted lines represent the top and bottom extremities of the intersection of the transverse channel, which 
are expanded to the right along with the thick dashed lines representing the area-limiting cross section (outer black line) and 
the Monte Carlo radial limit (inner blue line) that lie in the a: 2 -plane. For visibility purposes the polypeptide is enlarged by 
a factor of 3 in both of its dimensions from the actual scale that we used in simulations while the ion radius is enlarged by a 
factor of 1.5. 


amino acids are detected by a pair of electrodes trans¬ 
verse to peptide translocation, has been shown to be suc¬ 
cessful in identifying single amino acids and even in dif¬ 
ferentiating between tyrosine and phosphotyrosine [13] , a 
post-translational modification. However, only 12 of the 
20 amino acids were able to be detected by this technique 
with two different electrode gap distances (0.55 nm and 
0.7 nm) since the tunneling current is highly dependent 
on this gap distance and an amino acid’s ability to enter 
the gap. In other words, a single gap cannot be used for 
all amino acids. 

This brings us to our proposed technique, sequenc¬ 
ing proteins with transverse ionic transport. Like the 
two aforementioned techniques, this method is inspired 
by a DNA sequencing method [HI [15] and does not re¬ 
quire reagents or fractionation since these devices do not 
place a limit on the length of the polypeptide uni , mean¬ 
ing these nanopore techniques have the potential to be 
much faster. The structure of this proposed device is 
the same as in [HKii], with a longitudinal nanochannel 
for polypeptide translocation and an intersecting trans¬ 
verse nanochannel for ionic transport driven by an elec¬ 
tric field, E±, as in Fig. However, the longitudinal 


nanochannel must be larger than in |14] to accomodate 
the various sizes of the amino acids and instead of 4 DNA 
bases we need to distinguish 20 amino acids. Therefore, 
the molecular dynamics (MD) simulation method utilized 
in m is time prohibitive for our purposes so we resort 
to a hard sphere model to account for the electrostatic 
properties of each amino acid, which requires only one 
MD run per amino acid to execute. Afterwards we use 
Monte Carlo sampling to calculate ionic current distri¬ 
butions based on external azimuthal rotations {(j)') and 
dihedral angle and ip) distributions, or Ramachandran 
plots. We show that the distribution of ionic currents for 
each of the 20 proteinogenic amino acids encoded by eu¬ 
karyotic genes is indeed statistically distinct, and propose 
a protocol for de novo protein sequencing based on this 
technique. 


THEORETICAL APPROACH 

Let us then consider the configuration of crossed 
nanochannels we have in mind. Although not neces¬ 
sary for our conclusions, we assume for simplicity the 
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nanochannels to have circular cross sections. We will dis¬ 
cuss the suggested experimental preparation later in the 
manuscript, in the section titled Sequencing Protocol. 

The polypeptide of interest unfolds inside a nanochan¬ 
nel pulled with a longitudinal force, while it blocks the 
ionic current flowing in a transverse channel, as schemat¬ 
ically shown in Fig. 

It is well understood that the hydration layers sur¬ 
rounding each amino acid have different binding energies 
[Mini, which certainly affect the ionic transport trans¬ 
verse to each amino acid. In addition, the amino acid 
may attract or repel ions due to its solvated charge or 
polarity state [HI [19]. In order to understand the aque¬ 
ous environment of each amino acid and determine its 
effect on the ionic transport, we run MD simulations for 
each amino acid. We consider the system at normal hu¬ 
man body temperature, 310 K, and the solvated system 
is large enough to make quantum effects negligible. This 
allows us to use classical molecular dynamics and employ 
the highly-parallel NAMD2 [50] to run all of our simula¬ 
tions. 

The MD setup starts with a single amino acid isolated 
from a straight (dihedral angles t/j = cj) = 180°) peptide 
chain, as in Fig. [^with proline (PRO) as an exception, 
which is positioned so that the 2 ;-axis is the longitudinal 
axis. The rest of the MD methods can be found in the 
Supporting Information. 

The water padding is large enough in this system to 
examine proximal radial distribution functions (pRDFs) 
from the amino acid’s surface for K+ and Cl“ up to the 
point where the concentrations level out to the bulk val¬ 
ues. We use the radius from the surface of the amino 
acid because the features in the concentration will be 
more prominent as opposed to using the radius from the 
origin, since the amino acids have irregular shapes. Sim¬ 
ilarly calculated pRDFs on DNA have been shown to be 
fairly accurate for reconstructing the surrounding solute 
even when combining all surface atoms’ pRDFs into one 
[mnn, as is done in our calculations. 

To obtain the pRDFs, we count the number of ions 
(for K+ and Cl“) in 0.5 A thick shells starting from the 
surface of each amino acid, which is defined by the in¬ 
tersection of the composing atoms’ van der Waals (vdW) 
spheres. We then calculate the volume of each shell by 
subtracting the inner volume of the intersecting spheres 
from the outer volume, using a grid approximation with 
0.1 A sides for each volume calculation. With the num¬ 
ber of ions and the volume of the corresponding shell we 
calculate the local concentration of K+ and Cl~ as a func¬ 
tion of r>, taken to be the perpendicular distance from 
the vdW surface to the radial midpoint of the shell, from 
the first shell at r> = 0.25 A to the last at = 44.75 A, 
which is below the 4.8 nm upper bound of water padding. 

As can be seen in Fig. [^ the concentrations reach a 
sufficiently steady bulk value at varying radii, with the 



FIG. 2. (Color online) Plots of ionic concentration against 
distance from each amino acid’s vdW surface, r>, for amino 
acids GLU, LYS, and MET. is represented by the purple 
line and Cl~ is represented by the green line. 

maximum bulk r> determined to be approximately 15 A. 
Therefore we can focus on the part of the plots pertain¬ 
ing to r> < 15 A to determine the solvation properties 
of each amino acid. As an example of our numerical pro¬ 
cedure, we have chosen to feature the amino acid GLU 
in Fig. which has a negatively charged side chain at 
physiological pH (7.4), lysine (LYS) in Fig. [^, which 
has a positively charged side chain at the same pH, and 
methionine (MET) in Fig. HP, whose side chain is hy- 
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drophobic at this pH. These three amino acids are of 
similar size, which allows us to better compare the ef¬ 
fects of charge states on transverse ionic current. We 
can immediately notice that the part of the pRDFs that 
we care about is quite different for each featured amino 
acid. GLU in Fig. IK has a higher concentration of K+ 
due to its negativity while LYS in Fig. has a higher 
concentration of Cl~ due to its positivity. Then there is 
the hydrophobic MET in Fig. IP, which appropriately 
repels both K+ and Cl~ without much preference. 

In the setting of an external electric field driving trans¬ 
verse ionic flow around an amino acid within a peptide, 
the potential barrier that ions must overcome in trans¬ 
port is influenced mostly by the electric potential in the 
neighborhood of the area-limiting cross section perpen¬ 
dicular to the ionic flow, imaged in Fig. as the black 
thick-dashed line. This is partly due to the short inter¬ 
ference time between the flowing ions and the circum¬ 
vented amino acid. In our theoretical approach, we treat 
the equilibrium ionic concentrations as indicators of this 
electric potential to develop a hard sphere model with 
which we can calculate the distribution of ionic current 
for each amino acid. By calculating an effective radius, 
Teff, that is applied to every atom in the amino acid be¬ 
yond its vdW radius, we can sample many amino acid 
orientations using a Monte Carlo approach to determine 
all of the ionic current distributions. We theorize that 
most of the variation in the transverse ionic transport 
will come from the exclusionary effects of the amino acid 
with respect to the direction of ionic flow, meaning that 
a large pool of orientations must be sampled to obtain 
an accurate view of these distributions. 

In order to obtain the effective radius for each amino 
acid, we start with the definition of the average transverse 
ionic current of an ionic species i, K+ or Cl~ in our case, 
with valency Zi flowing around an amino acid. 

(I^) = qzi [ [ {g^v^(r,e'{r'),(j)'{f')))f,rA6dr 

Jn JSi 

r^f 

= C {giVi)fi{r)rdr 


fluence compared to neighboring amino acids, is the 
radius where gi, the local number density as a function 
of spherical coordinates, is first nonzero, rf is the radius 
where the influence of the amino acid is no longer felt 
in the concentration and thus we need not continue the 
integral for the purpose of the effective radius, res, cal¬ 
culation. Vi is the transverse velocity through the cross 
section as a function of standard spherical coordinates 
while A(r) is the surface area of the sphere of radius r. 
In addition, is the perpendicular distance from the 
vdW surface to the radial midpoint of a shell of thick¬ 
ness dr> and surface area H, Vo is the average radius from 
the origin to the vdW surface, is a value of r> where 
the pRDF, gi (plotted in Fig. |]), has become sufficiently 
steady around the bulk density, gi^^, so as to represent a 
shell in the bulk, Vi is the flow velocity of the ion species 
j as a function of r>, while Vi^b is the maximum of Vi, 
which occurs in the bulk by construction. 

The first approximation that we make is that all of the 
rotational orientations are uniformly likely, when in real¬ 
ity O' is fairly constant due to the stiffness of the peptide 
bond and given how small the diameter of the pore is 
in comparison to the length. However, when we average 
over (j)' we fully explore the number density around the 
shell, so averaging over O' does not introduce any new 
data but adds more weight to the side chain as opposed 
to the ends of the backbone. This counteracts the simpli¬ 
fication we make in our MD runs where we use isolated 
amino acids and include the number density at the ends 
of the backbone, which would normally be expelled by 
the nearest neighbor amino acids. Also, the internal dihe¬ 
drals are assumed fixed since they do not fluctuate much 
under the imposed longitudinal electric field (see their 
implementation in the current distribution calculations). 
Lastly, when we change variables from r to r> we have to 
approximate r as -1- Tq , which is a minor approxima¬ 
tion when considering that all of the other functions in 
the integral have well-defined transformations. We can 
now use the following simplified equation to calculate the 
effective radius for our hard sphere model for every amino 
acid and ion species combination: 


c [ 9i{r>)vi{r>) 

JO 


( 1 ) 


2(r>+r,)*> 
A{r>) 




Here, the identifying transverse ionic current, li, through 
the aforementioned area-limiting cross section perpendic¬ 
ular to the ionic flow, is averaged over all rotational ori¬ 
entations equally with f' representing the unit r' vector 
of the amino acid while g is the electron charge and C, 
C, and C" are constants. [0i,0f] is the window of 0 
where the amino acid under study has non-negligible in¬ 


r Mr>) ^ 

X^,2(r>+ro) > 

r*” 9^{r>)vi{r>) A(r>) 

Jo 9i,h Vi^h 2(r>-|-ro) 

However, this equation requires the ratio of the trans¬ 
verse flow velocity compared to the bulk, and due to the 
small length scales we can use the Stokes equation, simi¬ 
lar to [53]. The details of this calculation can be found in 
the Supporting Information. From these calculations we 
find that ro = (R —ro)/2 and then from our pRDF plots 
(see Fig. we learn that the bulk concentrations start 
at approximately rb > 15 A. Therefore for our model to 
work we have to take R > 30-|-max{ro} = 34.16 A, where 
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FIG. 3. (Color online) The top graph represents area plots 
of Cl“ around LYS while the bottom graph shows area plots 
of K’^ around LYS. The straight magenta line is the average 
cross-sectional area that the shell of thickness Ar> = 0.5 A 
at r>, the distance from the vdW surface of LYS, occupies in 
the plane y = 0. The blue line represents the average cross- 
sectional area that the ionic solution, with the number of ions 
from the shell of thickness Ar> = 0.5 A at r>, would occupy 
in the plane y = 0 ii those ions were reorganized to have bulk 
concentration, gt^b- The smooth green curve is the blue curve 
modulated by the ratio of the velocity with its maximum, 
Vijvi^b, which is plotted in the inset of each graph. The area 
under the smooth green curve is equal to the shaded gray area 
under the straight magenta curve while the dashed vertical 
line marks the effective radius for ion species i specifically for 
LYS. 


the max is over all amino acids, and then in the interest 
of minimizing the bulk ionic current we choose R = 35 
A. We also set the transverse nanochannel radius to the 
same value for simplicity. 

The insets of Fig. show the results of our calcula¬ 
tions for Vifvi^h] the top graph represents Cl~ around 
LYS while the bottom graph shows K+ around LYS. The 
other amino acids have similar parabolic forms for Vi/vi^b, 
but differing because of differing Tq. With Vi/vi^b cal¬ 
culated for every amino acid we can return to Eq. (§ 
to calculate our effective radii for our hard sphere model. 


This calculation is shown graphically in Fig. where 
the straight magenta line is the argument (including dr> 
as Ar> = 0.5 A) of the left-hand side of Eq. ([^, which 
is the average cross-sectional area that the shell of thick¬ 
ness Ar> at r> occupies in the plane of interest {y = 0). 
The blue line represents the argument of the right hand 
side of Eq. ([^, again including dr> as Ar> = 0.5 A 
without the modulation of the velocity ratio, leaving the 
average cross-sectional area that the ionic solution, with 
the number of ions from the shell of thickness Ar> at 
r>, would occupy in the plane of interest (y = 0) if those 
ions were reorganized to have concentration gi^b- Finally 
the smooth green curve is the blue curve modulated by 
Vi/vi^b- The area under the smooth green curve is equal 
to the shaded gray area under the straight magenta curve, 
with the dashed vertical line marking not only where the 
shaded gray area ends on the left but also the effective 
radius for ion species i for the given amino acid. Be¬ 
cause of the influence of the velocity, the fluctuations in 
concentration farther from the amino acid have more ef¬ 
fect than closely bound spikes. For example, the Cl“ 
ion atmosphere located 1 A from the surface of LYS has 
less effect on the effective radius compared to the next 
spike in concentration further out from the amino acid, 
as seen in the green curve. The fact that LYS is pos¬ 
itively charged still shows in the effective radii though, 
with the attractive Cl“ ions having a 5.38 A addition to 
the vdW surface compared to 6.43 A for the repulsive 
K+ ions. The rest of the effective radii can be found in 
Table of the Supporting Information. 

This brings us to our Monte Carlo calculation of the 
transverse ionic current around each amino acid. Now 
that we have res for each ion species that we add to the 
vdW radius of every atom in our amino acid, we can 
compare the available cross-sectional area through the 
y = 0 plane and apply the same bulk concentration and 
estimated bulk velocity, yb = 1 M and Vb, to all amino 
acids to obtain the ionic current values. We do not need 
to evaluate the available area in the entire cross section 
though since we only need to calculate up to the largest 
radius determined by reg for all amino acids. Therefore 
we use a radius of R/2 from the origin (see Fig. [^where 
we are now limited to r = R) as the circular boundary 
for all of the amino acids since this circle encloses all of 
the extended amino acid surfaces in any applicable ro¬ 
tational configuration while also being enclosed by the 
bulk boundary defined by r> = rb where the velocity 
begins to decline from z)b. We also approximate [0i,0f] 
as [tt/L, 37r/4] by comparing the backbone ends’ vdW ra¬ 
dius to half of the distance in z between amino acids (half 
of ideally ^ 3.8 A [24]). In this manner we can ignore 
portions of the cross section that would clearly be dom¬ 
inated by neighboring amino acids for the purposes of 
understanding each amino acid’s transverse ionic trans¬ 
port signature. 

As previously mentioned, the current becomes sensi- 
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tive to rotational conformations and dihedral angles in 
this portion of the calculation. Therefore, instead of as¬ 
suming uniformity in 0' and straight dihedral angles like 
we did for the effective radius, we fix 9' to 0 due to the 
rigidity of the peptide bond and we use Ramachandran 
plots, [23[in]) to sample realistic values for (j) and di¬ 
hedral angles, which encompass the internal degrees of 
freedom for a chain of amino acids [IS [27] . That leaves 
the azimuthal angle, (j)' ^ which we leave as uniformly dis¬ 
tributed since as a whole the peptide does not have an 
azimuthal preference, except if the peptide is very short 
in which case the transverse electric held that is only ap¬ 
plied to a few amino acids can affect the entire chain. 
We then apply Monte Carlo to a lone amino acid, the 
details of which can be found in the Supporting Infor¬ 
mation. The reason we use a lone amino acid, the same 
one from our MD simulations, for calculating the ionic 
current distributions is that the hrst step to understand¬ 
ing the viability of this technique is distinguishing each 
amino acid separately via transverse ionic current. Since 
most of the exclusion due to the amino acid comes from 
the region of small z, where the uniqueness of the amino 
acid is demonstrated, the exclusion from one amino acid 
in a chain can be derived from our single amino acid 
distributions. As a result we do not treat the effect of 
neighboring PRO, which alters the dihedral angles so as 
to straighten the polypeptide chain. However, changing 
an amino acid’s dihedrals slightly does not change the 
ionic current distributions much since most of the varia¬ 
tion in the current comes from azimuthal rotation of the 
amino acid. 

Lastly, we must calculate the bulk velocity, Vh, that we 
will use in the simple equation for the transverse ionic 
current, h = qZig\,v\,{Ai) and / = where (A*) 

is the average area outside of the effective surface from 
Monte Carlo. This calculation can be found in the Sup¬ 
porting Information, resulting in Vh = 77.23 m/s. 

RESULTS AND DISCUSSION 

With a set of ionic currents for each amino acid deter¬ 
mined from Monte Carlo utilizing our hard sphere model, 
we histogram each set of currents and use cubic spline in¬ 
terpolation to arrive at Fig. [^ The ionic currents tend 
to form multimodal (most often bimodal) distributions 
that are best described as a mixture of several normal 
distributions. The first and last peaks of each distri¬ 
bution tend to be the highest due to the variation in 
(j)' . This is because the ionic current as a function of (j>' 
is roughly sinusoidal with a period of tt and </>' is uni¬ 
formly distributed, which means the near minimum and 
near maximum values of the ionic current are chosen the 
most. Also due to the size of the nanochannels, the ionic 
current ranges in the tens of nA, which is well within the 
range of modern measurement devices that can resolve 


pA currents [13 Beyond that, this ionic current 

only represents up to R/2 of the whole cross section. By 
using the parabolic v from the bulk region (see the end of 
Sequencing Protocol) we calculate the contribution from 
the rest of the cross section, r> > but still within the 
6 limitations, as 69.86 nA after correcting the velocity for 
experiment. This value is comparable to the ionic current 
values from Fig. [^ meaning the distinctive component 
of the ionic current will not be dwarfed by the bulk in an 
experimental setting. 

Although a fair number of amino acids do not deviate 
much from their vdW size identity, namely PRO remain¬ 
ing on the smaller side (large current) and phenylalanine 
on the larger side (small current), many more {e.g., ala¬ 
nine) have shifted due to their interaction with the ions. 
However, the vdW volume does remain strongly relevant 
in the standard deviation of the distributions, where the 
larger amino acids (arginine, phenylalanine, tryptophan, 
tyrosine) find more variation in ionic current as the di¬ 
hedrals or (j)' are altered. 

At a glance there is significant overlap between all of 
the distributions, yet the graph seems crowded mostly 
because of the sheer amount of plots to compare. We 
quantify the distinguishability of the ionic current distri¬ 
butions by calculating the error in selecting the correct 
amino acid. A, given M measurements from X. Based 
on the maximum likelihood decision rule [29], the error 
is defined by 



M \ 

m—l / 


where J is the total number of realizations of the error 
calculation, {Y} is the set of all 20 amino acids, H is the 
Heaviside step function, and is the probabil¬ 

ity of Imji the jth realization of the mth ionic current 
measurement sampled from the current distribution for 
A, in A’s ionic current distribution. Here, we assume 
that each measurement of ionic current is approximately 
independent. Next we average over A to obtain {e^)x 
and then multiply by 100 to get the error percentage, 
which is plotted in the inset of Fig. [^ The error drops 
at a moderate rate with increasing M, but significantly 
drops off for M > 160 when the likelihood of at least one 
measurement giving zero probability to incorrect amino 
acids becomes very likely, making the product of those 
incorrect probabilities zero. For instance, at M = 175 
the error percentage is practically 0%, and certainly less 
than 0.1%, a reasonable level of error. With a measure¬ 
ment frequency of 100 kHz, [55]) and a best case scenario 
of 175 measurements per residue without any lapses in 
between, the sequencing rate becomes 571 residues per 
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FIG. 4. (Color online) The transverse ionic current distributions for all 20 proteinogenic amino acids encoded by eukaryotic 
genes (identified with their standard three-letter abbreviations). The distributions have been normalized to the current values 
in nA. The inset plots the average error percentage over all 20 amino acids of identifying an amino acid correctly using M 
current measurements from that amino acid where the error percentage is on a log scale. 


second. 


SEQUENCING PROTOCOL 

To build a nanofluidic device with intersecting channels 
as we suggest one may employ focused ion beam milling, 
as achieved in m with two 10 nm diameter intersect¬ 
ing nanochannels. Our model requires two 7 nm diame¬ 
ter intersecting nanochannels, which is certainly achiev¬ 
able given that |30| has shown non-intersecting sub-5 nm 
nanochannels from the focused ion beam milling tech¬ 
nique. Although we have predicted that all 20 amino 
acids are statistically distinct within the framework of 
circular channels, other cross sections like rectangles or 
ellipses for the transverse channel allow fewer amino acids 
to blockade the ionic transport but still provide enough 
space for ions to flow past the translocating polypeptide. 
This results in improved residue selectivity and there¬ 
fore decreased error as well as reduced post-processing 
time for deconvolution of the amino acid signals, which 
is necessary if more than one amino acid resides in the 
nanochannel intersection. Since the source of the dis- 
tinguishability of the amino acids is their structural and 
electronic uniqueness we can assume that using a rect¬ 
angular or elliptical transverse cross section with enough 
space along x for ionic flow would also result in 20 sta¬ 
tistically distinct amino acids. 

Once the sequencing device is built with transverse 
electrodes to control ionic flow, the protein or polypep¬ 
tide of choice must be unfolded to translocate it through 
the longitudinal nanochannel. By using a high enough 
pulling force, around 250 pN PUSH that we also ap¬ 
ply to our model, the polypeptide will unfold as well 
as translocate through the nanochannel. As opposed to 


chemical denaturing, force unfolding results in more con¬ 
fined and reliable Ramachandran plots PI EH, which 
directly translates to more reliable ionic current distri¬ 
butions. After the polypeptide is unfolded the pulling 
force can be adjusted according to one’s ionic current 
measurement frequency and desired rate of error. For ex¬ 
ample, a desired 0.1% or less of error requires M = 175 
and with a sequencing rate of 100 kHz as before, the 
maximum pulling speed would be 217 nm/s assuming an 
amino acid length of 3.8 A. As a result, the maximum 
applicable pulling force would be ^ 180 pN [5T] . 

The next issue is then how this polypeptide is pulled 
through the nanochannel. As we have discussed, amino 
acids have varied charge states in solution. Therefore, to 
utilize an electric field for pulling (see Fig. one has to 
attach charges to the polypeptide. These charges must be 
attached at the end of the chain so that one does not in¬ 
terfere with the ionic transport signatures of each amino 
acid. The best way to achieve this is by using a combina¬ 
tion of solid phase peptide synthesis (SPPS), which excels 
at synthesizing smaller peptides |32j , and native chemical 
ligation (NCL) [ 33 ] to attach a sequence of charged amino 
acids to the N-terminus of the polypeptide under study. 
We choose GLU as our charged amino acid because 
of how easily differentiable it is from the other amino 
acids (see Fig. and how easy it is to produce. Us¬ 
ing Fmoc, 9-fluorenylmethyloxycarbonyl or the chemical 
group that protects the N-terminus from reactions until 
desired, SPPS starting with N,N-bis(2-mercaptoethyl)- 
amide (BMEA) [ 31 ] one creates a sequence of GLU with 
a length that will give the polypeptide chain plus GLU se¬ 
quence a large enough charge to pull with an electric field. 
Fmoc SPPS is also used to attach a GYS residue to the 
N-terminus of the unknown polypeptide with a polyethy¬ 
lene glycol (PEG) support [ 33 ]. Then one uses NCL to 








take advantage of the transthioesterification reaction to 
form a native amide bond between the N-terminal CYS 
residue and the thioester precursor BMEA |34) . 

Another option is to use optical tweezers [Ml Et] to 
target a terminal amino acid to pull the whole polypep¬ 
tide. This approach has been utilized for longitudinal 
nanopore DNA sequencing [MUM]; resulting in more con¬ 
trol over translocation due to the high tunability of opti¬ 
cal tweezers. Advances in optical tweezers further allow a 
single beam to trap multiple targets [40] , potentially with 
computer-generated holograms im, which would allow 
even more control over the entire polypeptide. 


SUMMARY 

We have proposed a novel de novo protein sequenc¬ 
ing method in which an unfolded protein confined to 
a nanochannel is probed by transverse ionic transport 
through an intersecting nanochannel. This method 
promises to offer improved discrimination between amino 
acids by utilizing the 3-dimensional structure and elec¬ 
tronic properties of each amino acid, as compared to 
techniques like mass spectrometry that can only probe 
total mass and charge [6]. We developed a hard sphere 
model for transverse ionic transport that employs the av¬ 
erage equilibrium ionic concentrations surrounding all 20 
amino acids derived from MD and ionic flow ratios de¬ 
termined by the Stokes equation. With this hard sphere 
model we were able to calculate distributions of ionic cur¬ 
rent for each amino acid based on Monte Carlo sampling 
of internal and external rotational conformations. All 20 
amino acids were found to be statistically distinct and a 
sequencing error rate per residue of less than 0.1% was 
obtained with M = 175 measurements per amino acid, 
implying a best case scenario of 571 residues per second 
with a measurement frequency of 100 kHz [28] . 

This approach is certainly experimentally achievable 
since 10 nm diameter intersecting nanochannels have 
been demonstrated for the purpose of DNA sequenc¬ 
ing m and polypeptides can be pulled through the 
nanochannel with optical tweezers or by adding charged 
residues to the polypeptide terminus and employing an 
electric held. Protein sequencing is very important since 
DNA sequencing cannot predict post-translational modi- 
hcations and the ability to identify the sequence of a pro¬ 
tein leads to the ability to understand its structure, which 
is the key to understanding many crippling diseases like 
Alzheimer’s |5]. We therefore hope our work will moti¬ 
vate the experimental realization of the proposed protein 
sequencing protocol. 
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SUPPORTING INFORMATION 


Molecular Dynamics The amino acid is centered 
along the z-axis according to the geometric center in z of 
its terminal N and C atoms, while the molecule is cen¬ 
tered in the a:?/-plane according to the geometric center 
in X and y of its terminal N atom and a nearest neigh¬ 
boring amino acid’s terminal N atom. To fix the rotation 
angle between amino acids, as a convention, the terminal 
N atoms always have ?/ = 0, as is the case in Fig. [TJ 

Since PRO has more rigid dihedral angles, we need 
to center it with the help of two neighboring GLYs, 
which have flexible dihedral angles, on each side of a sin¬ 
gle PRO. The nearest neighbor GLYs are configured to 
have dihedral angles that compensate for those of PRO 
while the farthest neighbor GLYs are configured to be 
straight, so that PRO and the two straight GLYs are di¬ 
rected along the longitudinal axis while the two straight 
GLYs are aligned in the ccy-plane. As a result, we can 
center and then isolate PRO by using the usual center¬ 
ing method on the geometric average of the two straight 
GLYs, staying consistent with the choice of angles for the 
rest of the amino acids. 

Once the amino acid is isolated, we solvate the system 
into a right hexagonal prism with regular hexagonal xy- 
planes having a height of 11 nm and an apothem of 5.9 
nm to be used in NAMD2 with periodic boundary condi¬ 
tions in all three dimensions of space. This configuration 
gives every atom from every amino acid at least 4.8 nm 
of water padding in the unit cell, or in other words at 
least 9.6 nm of water between any atom and the closest 
atom in any neighboring periodic image. We then passi¬ 
vate and ionize the system to about 1 M of KGl, a typ¬ 
ical biological solute. The size of the ions will certainly 
change the average local concentrations near the amino 
acid, which may then affect the ionic transport. We uti¬ 
lize the GHARMM22 with CMAP force field |3H |33] for 
all of the amino acid, TIP3P water, and ion interactions. 
Each amino acid was held fixed throughout the run so 
that it would not diffuse around and the surrounding 
solution could equilibrate and be analyzed consistently. 
After equilibrating at 0 K and progressively ramping up 
the temperature to 310 K, the system is allowed to evolve 
in an NPT ensemble first for 1 ns followed by an NVT 
ensemble for 5 ns, all with 1 fs time steps and 1 ps co¬ 
ordinate recordings. The temperature is held fixed using 
a Langevin thermostat with a damping coefficient of 5 
ps“^. The first ns of the NVT production run is discarded 
as transient, leaving 4 ns of run time, or 4000 coordinate 
snapshots, to analyze radial concentration profiles. 

Velocity calculation Due to the small length scales of 
the intersecting portion of our nanochannel system, we 
can use the Stokes equation. 


da: 



-I- qZigi{x)E± = 0, 


(SI) 


where g is the dynamic viscosity of the fluid, E± is the 
external electric field applied in the y direction, Vi is 
the transverse velocity through the cross section, gi is 
the local number density, and ion-ion interactions are ig¬ 
nored. Here the flow velocity is independent of y due to 
the fact that the transverse nanochannel length is much 
larger than the diameter of the longitudinal nanochan¬ 
nel, 2R, and that 2R is comparable to the diameter of 
the transverse nanochannel (as depicted in Fig. [^. In¬ 
dependence from z is similarly due to the longitudinal 
nanochannel’s length being much larger than 2R but we 
must also choose R to be large enough for ions to dif¬ 
fuse along z after they enter the longitudinal channel at 
y = ±R. This will make any variation along z have neg¬ 
ligible impact on the end result of an average Vi over all 
rotational conformations. In our case we simply use the 
dynamic viscosity of water (/i = 7.5 x 10“^ Pa • s), even 
though the viscosity of water with ions will vary slightly 
[23] , and a reasonable value of E± = 5 x 10® N/G taken 
from |23] . However, we must transform this equation into 
one that depends on r> to obtain Vi. Since Vi and gi are 
averages over all rotational orientations, the problem is 
condensed to the region x > 0 and [9i,9f]. With 9i and 
9f close enough to tt/2, we can approximate r> as x —Xo, 
where Xo is some constant, since at least for the amino 
acid backbone the contour lines of r> resemble those of 
X. With these approximations we obtain 

dr^ + qZigi{r>)E± = 0. (S2) 

This will give us the rough form of Vijvi^\y between the 
following boundary conditions: 


yi{r> = 0 ) = 0 

Vi{r> = r> = ^ > v^{ry). 


(S3) 


f> = {R — ro)/2 is approximately halfway between the 
vdW surface and the longitudinal nanochannel surface 
and is also our upper bound on r> as the domain of Vi. 
To obtain Vi for the entire range necessary for Eq. Q, 
we require that r\^ = = {R — ro)/2 since r> must be 

in the bulk as well. From our pRDF plots (see Fig. we 
learn that the bulk concentrations start at approximately 
r> = 15 A, meaning that Xb > 15 A. Therefore for our 
model to work we have to take R > 30-f maxjro} = 34.16 
A, where the max is over all amino acids, and then in the 
interest of minimizing the bulk ionic current we choose 
R = 35 A. 

Monte Carlo The Ramachandran plots that we use in 
our Monte Garlo calculations account for a 250 pN lon¬ 
gitudinal force that is applied to the polypeptide chain 
(ubiquitin and polyglycine in [25]) to pull it through the 
nanochannel. The pulling force acts to limit the phase 
space available to the dihedral angle pair (<(),■!/:), making 
the configurations that are close to straight [i/; = (j) = 
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180°) much more appealing PRO and GLY have sig¬ 
nificantly different plots from the rest of the amino acids 
due to how the side chain of PRO bonds with its own 
amine nitrogen, part of the amino acid backbone, leading 
to restricted dihedrals while GLY has a hydrogen instead 
of a side chain leading to more freedom in the dihedrals. 
This way, while the rest of the amino acids are described 
by the Ramachandran plot of ubiquitin, which contains 
all 20 of the proteinogenic amino acids encoded by eu¬ 
karyotic genes and well represents 18 of them, we describe 
PRO with the {(p, ip) plol from the isolated PRO values 
within ubiquitin and GLY with the Ramachandran plot 
of the polyglycine analog of ubiquitin [25]. With these 
Ramachandran plots we use Monte Carlo sampling to ob¬ 
tain {(p, Ip) pairs that we then implement on a lone amino 
acid, where the number of realizations is dependent on 
the size of the domain of the Ramachandran plot (1408 
realizations for ubiquitin). We also rotate the amino acid 
in <p' G [0, 27r) by all multiples of 7r/12. Then the amino 
acid is projected onto the y = 0 plane and using Monte 
Carlo (1000 realizations) we calculate the area outside 
of the effective surface, called Ai, yet within either 7r/2 
sector of radius R/2 centered around z = 0, where the 
ion i will fit according to its vdW radius. 

Maximum velocity Since Vb is the maximum velocity 
between the amino acid and the longitudinal nanochannel 
surface as aforementioned, we can use Eq. (SI I to obtain 
the max of v, which is equivalent to Wb. In this case we 
focus on the velocity within the bulk region, namely from 
the midpoint between the amino acid and the channel 
surface, x = a^mid = {R + ?'o)/2, to the channel surface, 
X = R. We employ the following boundary conditions, 


){x = i?) = 0, 


V{X = Xmid = -X-) > V{x), 


(S4) 


which are very similar to Eq. (S3). By assuming a con¬ 
stant bulk concentration, g\,^ over this region we quickly 


come to a parabolic solution to Eq. (SI) as well as de¬ 
termining Vh = qzghE±{R — ro)^/4/r, where Zi has been 
simplified to z since both ion species have the same va¬ 
lency. Then we have Vh = 154.47 m/s by choosing a 
reasonable ro = 4 A, a necessity in making Vh indepen¬ 
dent of the amino acid under study, which is more likely 
in experiment. In fact, the absolute value of the velocity 
determined from the Stokes equation is known to differ 
from experiment, |23| . as opposed to the velocity ratio 
that we have utilized thus far. However, these differ¬ 
ences appear to be systematic, |23|, and can be solved by 
dividing Vh in half, resulting in the corrected Vh = 77.23 
m/s. 
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TABLE SI. Effective radii in A 


Amino 

Acid 

reff 

for Gr 

reft 

for K+ 

ALA 

8.28 

7.68 

ARG 

3.50 

4.49 

ASN 

5.23 

5.55 

ASP 

6.15 

4.69 

CYS 

7.15 

6.85 

GLN 

4.85 

5.28 

GLU 

8.03 

7.44 

GLY 

6.18 

5.75 

HIS 

6.30 

6.08 

ILE 

5.25 

4.78 

LEU 

5.97 

5.74 

LYS 

5.38 

6.43 

MET 

4.92 

4.73 

PHE 

7.87 

7.84 

PRO 

5.02 

4.43 

SER 

6.85 

6.99 

THR 

4.15 

4.57 

TRP 

6.59 

6.74 

TYR 

6.93 

7.29 

VAL 

5.37 

5.27 



